Model Selection

Multimodal model

# Multimodal model

PP-Chart2Table is a multimodal model developed by the PaddlePaddle team, focusing on Chinese and English chart parsing and capable of efficiently converting charts into data tables.

Image-to-Text Supports Multiple Languages

Gemma 3 4b It Qat GGUF

Gemma 3 is a lightweight, advanced open model series from Google, built on the same research and technology used to create Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.

Text-to-Image English

Llm Jp Clip Vit Base Patch16

Japanese CLIP model trained on OpenCLIP framework, supporting zero-shot image classification tasks

Text-to-Image Japanese

Paligemma Longprompt V1 Safetensors

Experimental vision model combining keyword tags with long text descriptions for image prompt generation

Paligemma 3b Mix 448 Ft TableDetection

A multimodal table detection model fine-tuned from google/paligemma-3b-mix-448, specialized in identifying table regions in images

Paligemma Rich Captions

An image caption generation model fine-tuned on the DocCI dataset based on PaliGemma-3b, capable of generating detailed descriptions of 200-350 characters with reduced hallucination

Transformers English

Compare2Score is a model for image quality assessment that provides a quality score for images through a specific algorithm.

Image Enhancement

Vit Medium Patch16 Clip 224.tinyclip Yfcc15m

CLIP model based on ViT architecture for zero-shot image classification tasks

Image Classification

Siglip Large Patch16 384

SigLIP is a multimodal model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.

Siglip Large Patch16 256

SigLIP is a vision-language model pre-trained on the WebLi dataset, utilizing an improved sigmoid loss function to enhance performance

Siglip Base Patch16 512

SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved sigmoid loss function, excelling in image classification and image-text retrieval tasks.

Chinese Clip Vit Large Patch14

Chinese CLIP model based on Vision Transformer architecture, supporting cross-modal understanding and generation between images and text.

Siglip Base Patch16 224

SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function to optimize image-text matching tasks

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase